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Abstract 

O ■ In this paper we study the Prefix Sum problem introduced by Fred- 

man. We show that it is possible to perform both update and retrieval 
in O(l) time simultaneously under a memory model in which indi- 
vidual bits may be shared by several words. We also show that two 
variants (generalizations) of the problem can be solved optimally in 
0(lgiV) time under the comparison based model of computation. 



^ ! 1 Introduction 

O _ 

c/3 ', In this paper we discuss solutions to variants of the Prefix Sum problem 

(i.e. finding the sum of the first j elements in an array and also updating 
these values) which was introduced by Fredman [5]. Various lower bounds 
have been proven for the problem. We, however, focus on the problem 
under a nonstandard, though very feasible, model to achieve a constant 
time solution. In particular, we focus primarily on the so called RAMBO 
model of computation, which is an extension of the random access machine 
(RAM), that is a Random Access Machine with Byte Overlap, i.e. a bit 
can be in several words. This model was first suggested by Fredman and 
Saks [6] and further described and used by Brodnik et al. [3,4]. 

Fredman and Saks actually suggested the RAMBO model in connection 
with the Prefix Sum problem. They claim, with no hint of how it may 
be done, that Prefix Sum mod 2 can be solved in constant time under the 
model. We show how this can be done not only for Prefix Sum mod 2 but 
for Prefix Sum modulo an arbitrary universe size 
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The RAMBO model, besides the usual RAM operations (cf. [16]), also 
has a part of memory where a bit may occur in several registers or in several 
positions in one register. The way the bits occur in this part of the memory 
has to be specified as part of the model. One example of such a memory 
variant is a square of bits with n rows and n columns. A n-bit word can 
be fetched either as a row or a column. In such a memory each bit can be 
accessed either by the row word or the column word. 

Brodnik et al. [4] use a variant of RAMBO, referred to as the Yggdrasil 
variant, to solve the Priority Queue problem in O(l) worst case time. That 
variant has been implemented in hardware [11] and the actual rerouting 
of the bits on a word fetch is not difficult. In this paper we modify the 
Yggdrasil variant slightly and solve the Prefix Sum problem. This gives 
further evidence of the value of such an architecture, at least for a special 
purpose processor. 

Now let us formally define the Prefix Sum problem: 

Definition 1 The Prefix Sum problem is to maintain an array, A, of size 
N, and to support the following operations: 

Update (j , A) A(j) := A(j) + A 

Retrieve (j) return J2i=o 
where < j < N. 

Fredman showed that, under the comparison based model of computation, 
an O(lgTV) solution exists for the Prefix Sum problem [5]. 

The problem can be generalized in several ways and we start by adding 
another parameter, k to the Retrieve operation. This parameter is used to 
tell the starting point of the array interval to sum over. Hence, Retrieve (k , j ) 
returns X/?=i- •^•(*)> where < k < j < N. This variant is usually referred to 
as the Partial Sum or Range Sum problem. The Partial Sum problem can 
be solved using a solution to the Prefix Sum problem (Retrieve (k, j) = 
Retrieve(j) - Retrieve (k-1)). In fact, the two problems are often used 
interchangeably. 

Furthermore, there is no obvious reason to only allow addition in the 
Update and Retrieve operations. We can allow any binary function, ©, to 
be used. In fact we can allow the Update operation to use one function, © u , 
and the Retrieve operation to use another function, © r . We will refer to 
this variant of the problem as the General Prefix Sum problem. 

Moreover, one can allow array position to be inserted at or deleted from 
arbitrary places. Hence, we can have sparse arrays, e.g. an array where only 
.4(5) and .A(500) are present. Positions which have not yet been added or 
have been deleted have the value 0. We refer to this variant as the Dynamic 
Prefix Sum problem. Brodnik and Nilsson [13, pp 65-80] describe a data 
structure they call a BinSeT tree which can be modified slightly to support 
all operation of the Dynamic Prefix Sum problem in 0(lg iV) time. Another 
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generalization is to use multidimensional arrays and this variant has been 
studied by the data base community [2, 7, 8, 10, 14, 15]. 

Several lower bounds have been presented for this problem: Fredman 
showed a Q(\gN) algebraic complexity lower bound and a f2(lg N/ lg lg N) 
information-theoretic lower bound [5]. Yao [17] has shown that f2(lg N/ lg lg N) 
is an inherent lower bound under the semi-group model of computation and 
this was improved by Hampapuram and Fredman to fi(lgiV) [9]. We side 
step these lower bound by considering the RAMBO model of computation. 

As with all RAM based model we need to restrict the size of a word 
which can be stored and operated on. We denote the word size with b and 
assume that b = 2°^ which is true for most computers today. A bounded 
word size also implies a bounded universe of elements that we store in the 
array. We use M to denote the universe size. Hence all operations © have 
to be computed modulo M and we require that each of the operands and 
the result are stored in one word. 

We will use n and m to denote [lgiV] and [lgM] respectively. Hence, 
N <2 n and M < 2 m . Both n and m are less than or equal to b, (n, m < b). 
In one of the solutions we actually require that nm < b. 

In Sect. 2 we show a O(l) solution to the Prefix Sum problem under the 
RAMBO model using a modified Yggdrasil variant. In Sect. 3 we discuss 
a O(lgiV) solution to the General and Dynamic Prefix Sum problems and 
finally conclude the paper with some open questions in Sect. 4. 

2 An 0(1) Solution to the Prefix Sum Problem 

In our 0(1) solution to the Prefix Sum problem we use a complete binary 
tree on top of the array (Fig. 1). We label the nodes in standard heap order, 
i.e., the root is node v\ and the left and right children of a node i/j are V2i 
and f2i+i respectively. In each node we store m bits representing the sum 
of the leaves in the left subtree. Since we build a complete binary tree on 
top of the array we assume that N = 2 n (if this is not true we still build 
the complete tree and in worst case waste space proportional to N/2 — 1). 
We do not store the original array A since its values are stored implicitly in 
the tree. The only value not stored in the tree (if N = 2 n only) is A(N — 1) 
and we store this value explicitly (vnl). Formally we define: 

Definition 2 A N-m-tree is a complete binary tree with N leaves in which 
the internal nodes (v%) store a m-bit value. In addition, a m-bit value is 
stored separately (vnl ). 

To update A(j) (Algorithm 1) in this structure we have to update all 
the nodes on the path from leaf j to the root in which j belongs to the left 
subtree. To Retrieve (j) (Algorithm 2) we need to sum the values of all 
the nodes on the path from leaf j + 1 to the root in which j + 1 belongs to 
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right subtree. Note that the path corresponding to array position j starts 
at node Vn/2+j/2- 
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Figure 1: Complete binary tree ontop of A. Nodes are storing the sum of 
the values in the leaves covered by the left subtree. 



update (j, A) 
if (j == N-l) 

vnl = vnl + A; 
else 

i = N + j; 
while (i > 1) 
next = i div 2; 
if (i mod 2 == 0) 

Vnext = Vnext + A mod M) ; 

i = next ; 

Algorithm 1: Updating of a N-m-tree in O(lgiV) time. 



The method described above implies a 0(lg N) update and retrieval time 
in the RAM model. To achieve constant time update and retrieval we use 
a variant of the RAMBO model similar to the Yggdrasil variant. In the 
Yggdrasil variant, registers overlap as paths from leaf to root in a complete 
binary tree with one bit stored in each internal node [4]. We generalize the 
Yggdrasil variant and let it store m bits in each node and call this variant 
m- Yggdrasil. In any m- Yggdrasil, register reg[i] corresponds to the path 
from node vn/2+% to the root of the tree. Each register consists of nm < b 
bits. In total the m- Yggdrasil registers need (N — 1) • m bits. 

Now, we use the registers from m- Yggdrasil to store the nodes of our 
tree. The path corresponding to array position j is stored in reg[j/2] and 
hence all nodes along the path can be accessed at once. 

We let levels of the tree be counted from the internal nodes above the 
leaves starting at and ending with n — 1 at the root. If the zth bit of j 
is 1 then j is in the right subtree of the node on level i of the path and in 
the left otherwise. Hence j can be used to determine which nodes along the 
path should be updated (nodes corresponding to bits of j that are 0) and 
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retrieve (j) 
if (j == N-l) 

sum = vnl; 

i = N + j; 
else 

sum = 0; 

i = N + j + 1; 
while (i > 1) 

next = i div 2; 

if (i mod 2 == 1) 

sum = sum + v next mod M ; 

i = next ; 
return sum; 

Algorithm 2: Retrieve in a N-m-tree in 0(lgN) time. 



which nodes should be used when retrieving a sum (nodes corresponding to 
bits of j that are 1). 

When updating the m-Yggdrasil registers (Algorithm 3), for all bits of 
j, if the ith. bit of j is we add A to the value of the ith node along the 
path from j to the root. To do this we shift A to the corresponding position 
(A << (im)) and add to reg[j/2] . Instead of checking whether the ith bit 
of j is we can mask the shifted A with a value based on NOT j. The value 
consists of, if the ith bit of not j is 1, m Is shifted to the correct position 
and m 0s otherwise. 



update (j, A) 
if (j == N-l) 

vnl = vnl + A; 
else 

for (i=0; < n; i++) 

if (((j » i) and 1) == 0) 

reg[j/2] = reg[j/2] + (A « (i*m)); 

Algorithm 3: Updating of a N-m-tree stored in m-Yggdrasil memory 
(Q(lgAQ time). 



Actually, as long as the binary operation only affects the m bits that 
should be updated we can use word-size parallelism (cf. [3]) and perform 
the update of all nodes in parallel. In Sect. 2.1 we show that addition 
modulo M can be implemented affecting only m bits. 

We use two functions (dist (i) and mask(i)) to simplify the description 
of the update and retrieve methods. The function dist(i), (0 < i < 2 m ) 
computes nm-bit values. The values are n copies of the m bits in i. For 
example, given m = 3,n = 4 dist (010) is 010010010010. The function 
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mask(i), (0 < i < 2 n ) also computes nm-bit values. These values are 
computed as follow: bit j (0 < j < n) of i is copied to bits jm..{j + l)m — 1. 
For example, given m = 3,n = 4, mask(lOOl) is 111000000111. Both these 
functions can be implemented by using word-size parallelism [3] . 

We can update the tree in constant time using the procedure in Algo- 
rithm 4. First we make n copies of A and then mask out the copies we need. 
Then finally we add this to reg[j/2] and the masked distributed A and 
store the result in reg[j/2] . For the case when j = N — 1 we simply add 
to vnl and A and store it in vnl. This gives us the following lemma: 

Lemma 1 The update operation of the Prefix Sum problem can be supported 
in O(l) when parts of the N-m-tree is stored in a m-Yggdrasil memory. 



update (j, A) 
if (j == N-l) 

vnl = vnl + A; 
else 

reg[j/2] = reg[j/2] + (dist(A) and mask(NOT j)); 

Algorithm 4: Updating of a N-m-tree stored in m-Yggdrasil memory using 
word size parallelism (O(l) time). 



To support the retrieve method in constant time we use a table SUM[i] , 
(0 < i < 2 nm ) with m-bit values that are the sum modulo M of the n m-bit 
values in i. 

To retrieve the sum (Algorithm 5) we read the register reg corresponding 
to j and mask out the parts we need. Then we use the table SUM to calculate 
the sum. Finally, we add vnl to the sum if j = N — 1. 



retrieve (j) 
if (j == N-l) 

v = reg[j/2] and mask(j); 
else 

v = reg[(j+l)/2] and mask(j+l); 
sum = SUM [v] ; 
if (j == N-l) 

sum = vnl + sum; 
return sum; 

Algorithm 5: Retrieve in a N-m-tree stored in m-Yggdrasil memory using 
word size parallelism (O(l) time). 



The space needed by the table SUM is 2 nm ■ m = N l&M ■ m = M lgJV • m, 
which is rather large. In order to reduce the space requirement we can 
reduce, by half, the number of bits used as index into the table. This gives 



6 



us a space requirement of V M's N ■ m. We do this by shifting the top n/2 
m-bit values from reg down and computing the sum modulo M of these 
values and the bottom n/2 values. Then this new (n/2)m-bit value is used 
as index into SUM instead. 

We can actually repeat this process until we get the m-bit we desire, 
and hence we do not need the table SUM (Algorithm 6). However, this does 
increase the time complexity to O(lgn) = O(lglgiV). This gives us a trade 
off between space and time. By allowing O(i) steps for the retrieve method 
we need M lgAr / 2 ' • m bits for the table. 



retrieve (j) 
if (j == N-l) 

v = reg[j/2] and mask(j); 
else 

v = reg[(j+l)/2] and mask(j+l); 

l = \\gn\ ; 
do 

i = 

vnew = (v»((2 l )m)) + (v and ((l«((2 l )m))-l)) ; 

v = vnew; 
while (l > 0) 
if (j == N-l) 

sum = vnl + sum; 
return sum; 

Algorithm 6: Retrieve in a N-m-tree stored in m-Yggdrasil memory using 
no additional memory (O(lglgiV) time). 



Lemma 2 The retrieve operation of the Prefix Sum problem can be sup- 
ported in 0(i+l) time using 0{M^ N l 2 " -m + m) bits of memory in additions 
to the N-m-tree. Parts of the N-m-tree is stored in m- Yggdrasil memory. 

By adjusting i we can achieve the following result: 

Corollary 1 The retrieve operation of the Prefix Sum problem can be sup- 
ported in: 

• 0(1) time using 0(M^ N ^/ 2 -m) bits of memory in additions to the 
N-m-tree, with i = 1. 

• O(lglgAf) time using 0(m) bits of memory in additions to the N-m- 
tree, with i = |~lg lg N] . 

2.1 Addition modulo M 

Let us consider the two m-bit operands a and b which are split into two 
pieces each (a/ , ahi,bi and bhi)- The two pieces ai Q and a^i contain the 
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m/2 least and most significant bits of a respectively (similarly for b[ and 
bhi)- Note that ai Q and the other pieces are stored in m-bit but only the 
m/2 least significant bits are used. 

We can now add the the two operands 

cho = aio + ko (1) 
cl hi = a hi + b hi . (2) 

However, both c\\ and cl^i might need m/2 + 1 bits for its result. The 
m/2 + 1 bit of c\\ should be added to cl^i and we split c\\ into two pieces 
(cliojo and cli 0i hi) and add the most significant bits to cl^, 

Chi = Chi + ci 0t hi (3) 
Clo = c i0y i . (4) 

The result of a + b is now stored in q D and cu and we have not used more 
than m bits in any word. However, in total m + 1 might be needed for the 
value. 

To compute c mod M we can check whether or not c — M >= 0, if 
so c mod M = c — M and otherwise c mod M = c. However, we do not 
want to produce a negative value since that would affect all the bits in the 
word. Instead we add an additional 2 m to the value and compare to 2 m , i.e. 
c+2 m — M > 2 m . Since 2 m — M > this will never produce a negative value. 
Note that c + 2 m -M < M -1 + M -l + 2 m - M = M + 2 m -2<= 2 m+1 -2 
which only needs m + 1 to be represented. Hence, if we calculate this value 
using the strategy above we will not use more than m bits of any word. 

Furthermore, a straight forward less than comparison can not be per- 
formed using word-size parallelism since all bits of the words are considered. 
Instead we view the comparison as a check whether the m + 1st bit is set 
or not. If it is set the value is larger than or equal to 2 m . We can actually 
create a bit mask which consists of m Is if the m + 1st bit is set and m 0s 
otherwise 

d = (c + 2 m - M and 2 m ) - ((c + 2 m - M and 2 m ) » m) . (5) 

This bit mask d can then be used to calculate res = c mod M. Since res is 
equal to c — M if the m + 1st bit of c is set and c otherwise we get 

res = ((c — M) and d) OR (c and not d) . (6) 

When computing c—M we must make sure that we do not produce a negative 
value. This is done by using a similar strategy as for addition above, but 
we also set any of the bits in Chi,hi to 1 during the computation. If c — M 
is greater than this will not affect the result and otherwise the result will 
not be used. 
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We have a procedure which can be used to compute (a + b) mod M 
without using more than m bits in any word. Hence, word-size parallelism 
can be used and we get our main result from this section: 

Theorem 1 Using the N-m-tree together with the m-Yggdrasil memory we 
can support the operations of the Prefix Sum problem in 0{l + 1) time using 
(N — l)m bits of m-Yggdrasil memory and 0(M n l 2L •m+m) bits of ordinary 
memory. 

3 An 0(lg N) Solution to the General and Dynamic 
Prefix Sum Problem 

We can actually partially solve the General Prefix Sum problem using the 
N-m-tree data structure and the m-Yggdrasil variant of RAMBO. All bi- 
nary operations such that all elements in the universe have a unique inverse 
element (i.e. binary operations which form a Group with the set of elements 
in the universe) and only affect the m bits involved in the operation can be 
supported. This includes for example addition and subtraction but not the 
maximum function. 

To solve the General and Dynamic Prefix Sum problem for semi-group 
operations we modify the Binary Segment Tree (BinSeT) data structure 
suggested by Brodnik and Nilsson. It was designed to handle in-advance 
resource reservation [13, pp 65-80] and if it is slightly modified it can solve 
both the General and Dynamic Prefix Sum problems efficiently. The original 
BinSeT stores, in each internal node, fi, the maximum value over the interval, 
and 6, the change of the value over the interval. Further, it also stores r, 
the time of the left most event in the right subtree. 

Instead of storing times as interval dividers we store array indices. To 
solve the Dynamic Prefix Sum problem with addition as operation and we 
only need to store 5. When solving the General and Prefix Sum problem 
one need to store information depending on the two binary operations © u 
and © r . 

When adding a new array position or deleting an array position the tree 
is rebalanced (cf. [1,12]) and hence the height is always O(lgiV). When 
updating a value in an array position we start at the root and search for 
the proper leaf using the interval dividers. During the back tracking of the 
recursion we update the information stored in each affected node. 

At retrieval we process the information of the proper nodes when travers- 
ing the tree. Since the height of the tree is 0(lg N) all the operations can be 
performed in O(lgiV) time. This matches the lower bound by Hampapuram 
and Fredman [9] 

BinSeT consists of 0(N) nodes when we use it to solve the General 
Prefix Sum. Each node contains O(l) m-bit values and hence the total 
space requirement is 0{Nm) bits. 
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4 Conclusion 



The Dynamic and General Prefix Sum problems can both be solved opti- 
mally in 0(lgiV) using 0{Nm) space under the comparison based model 
with semi-group operations. 

The Prefix Sum problem can be solved in 0(1) time under the RAMBO 
model when we allow 0(V M^s N ~\) -m) bits of ordinary memory and O(Nm) 
bits of m-Yggdrasil memory to be used. This is a huge amount of ordinary 
memory and if we restrict the space requirement to be sub exponential in 
both N and M (0(m) bits of ordinary memory and 0(Nm) bits of m- 
Yggdrasil memory) we need to used O(lglgiV) time. We know of no bet- 
ter lower bound under RAMBO than the trivial £1(1) when only allowing 
0((N°W + M°W)m) space. 

Further, it is currently unknown if one can achieve a 0(1) solution to 
the Dynamic and General Prefix Sum problems using the RAMBO model. 
Another open question is whether or not it is possible achieve a o(lgiV) 
solution to the multidimensional variant. 
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